Skip to content

perf(geotiff): batch _nvcomp_batch_compress allocations and D2H readback (#1712)#1729

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-12-aa0e4ef9-03
May 12, 2026
Merged

perf(geotiff): batch _nvcomp_batch_compress allocations and D2H readback (#1712)#1729
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-12-aa0e4ef9-03

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

_nvcomp_batch_compress had two anti-patterns left over after the
decode-side batched D2H fix in #1552 and the device-pointer fix in
#1659:

  1. Per-tile cupy.empty allocations for compressed-output buffers.
    For an N-tile write this issued N memory-pool calls. Replace with one
    contiguous allocation sized n_tiles * max_cs plus per-tile slab
    views; the per-tile pointer array still lets nvCOMP write
    independent slabs in parallel. _check_gpu_memory now bounds the
    new allocation.

  2. Per-tile .get().tobytes() in the result-collection loop. Each
    .get() was a separate D2H transfer on the default stream, and the
    per-DMA setup cost dominated wall time for large-N writes (exactly
    the pattern Batch GDS-fallback per-tile GPU to host transfer #1552 fixed on the decode side). Replace with one
    cupy.concatenate of trimmed slabs and one .get(), then slice
    the host buffer by cumulative offsets. The adler32 deflate-wrap step
    stays on the host as before.

Real-world benchmark on an Ampere-class GPU: 2048x2048 float32 zstd
GPU write at tile_size=64 (1024 tiles) drops median from 84.3 ms to
54.7 ms (~35% reduction). The decode-side fix in #1552 measured a
similar 256ms -> 38ms drop on the analogous LZW path; this PR brings
the encode side into the same shape.

Closes #1712.

Test plan

  • test_no_per_tile_cupy_empty_in_compressed_pool and test_no_per_tile_get_in_result_loop: source-level guards that fail loudly if either anti-pattern returns.
  • test_gpu_write_roundtrip_after_batched_compress[deflate] and [zstd]: end-to-end GPU write + read produces the same float32 array, catching any off-by-one in the host-side cumulative offsets.
  • test_gpu_write_zero_tile_edge_case: tiny 32x32 input still round-trips through the fast path.
  • Existing 53 GPU writer tests in test_gpu_writer_compression_modes_2026_05_11.py, test_gpu_writer_attrs_1563.py, test_gpu_writer_band_first_1580.py, test_gpu_writer_nan_sentinel_1599.py, test_to_geotiff_gpu_fallback_1674.py, test_nvcomp_from_device_bufs_single_alloc_1659.py all pass.

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026
@brendancol brendancol requested a review from Copilot May 12, 2026 19:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the GeoTIFF GPU write path’s nvCOMP batch compression by removing per-tile device allocations and per-tile device→host readbacks, bringing the encode side in line with the already-batched decode patterns.

Changes:

  • Replace per-tile cupy.empty compressed-output allocations with a single contiguous device pool + per-tile slab views/pointers.
  • Replace per-tile .get().tobytes() readback with a packed cupy.concatenate(...) + single .get() and host slicing.
  • Add a new test module intended to prevent regressions back to the per-tile allocation/readback patterns and to round-trip deflate/zstd writes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
xrspatial/geotiff/_gpu_decode.py Implements the contiguous compressed-output pool and batched D2H readback in _nvcomp_batch_compress.
xrspatial/geotiff/tests/test_nvcomp_batch_compress_batched_1712.py Adds structural/source assertions and end-to-end round-trip tests for the updated batching behavior.
Comments suppressed due to low confidence (2)

xrspatial/geotiff/tests/test_nvcomp_batch_compress_batched_1712.py:106

  • These round-trip tests don’t currently guarantee that the nvCOMP GPU compress path actually ran: write_geotiff_gpu falls back to CPU codecs when _nvcomp_batch_compress returns None and does not necessarily raise RuntimeError, so the test can pass while never exercising the changed code. To ensure coverage, skip up-front unless nvCOMP is available/usable (e.g., _gpu_decode._get_nvcomp() is not None, and/or assert the writer took the nvCOMP branch via a spy/monkeypatch/strict-mode signal).
    with tempfile.TemporaryDirectory(prefix="nvcomp_batch_1712_") as td:
        path = os.path.join(td, f"roundtrip_{compression}.tif")
        try:
            write_geotiff_gpu(
                darr, path,
                compression=compression,
                tiled=True,
                tile_size=64,
            )
        except RuntimeError as e:
            # nvCOMP may be unavailable in this environment; the writer
            # falls back to CPU and that path doesn't exercise the
            # change. Skip rather than fail.
            pytest.skip(f"nvCOMP unavailable for {compression}: {e}")

xrspatial/geotiff/tests/test_nvcomp_batch_compress_batched_1712.py:124

  • The docstring says this is a “0-tile compress” edge case, but the test exercises a 32x32 single-tile write/read instead and even notes that n_tiles==0 never occurs via the writer. Either adjust the docstring to match what’s being tested (tiny single-tile round-trip) or add a direct unit-level test that actually hits the n_tiles == 0 branch in _nvcomp_batch_compress (e.g., via a stub nvCOMP handle).
def test_gpu_write_zero_tile_edge_case():
    """A 0-tile compress returns an empty list without indexing into None.

    The cumulative-sum / concat path must short-circuit before
    ``cupy.concatenate`` (which would raise on an empty list). The
    pre-fix loop simply iterated zero times, so the contract is the
    same empty-list output.
    """
    # Direct call into the internal function with n_tiles=0 is
    # awkward because it needs a libnvCOMP handle and matching opts.
    # Instead, exercise the public writer with a tiny single-tile
    # input and confirm the fast path does not crash. Real n_tiles==0
    # never occurs via the writer (every image has at least one tile).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2555 to +2559
trimmed = [
d_comp_bufs[i][:int(comp_sizes[i])] for i in range(n_tiles)
]
d_comp_concat = cupy.concatenate(trimmed)
host_concat = bytes(d_comp_concat.get())
Comment on lines +24 to +34
try:
import cupy
_HAS_CUPY = True
except Exception:
_HAS_CUPY = False

# nvCOMP is the entry point that exercises this code path.
from xrspatial.geotiff import _gpu_decode

needs_cupy = pytest.mark.skipif(
not _HAS_CUPY, reason="CUDA / cupy unavailable on this host"
…ack (#1712)

Two related anti-patterns in the GPU compress path remained after the
decode-side fix in #1552 and the device-pointer fix in #1659:

1. Per-tile cupy.empty allocations for the compressed-output buffers.
   For an N-tile write, this issued N memory-pool calls. Replace with
   one contiguous allocation of size n_tiles * max_cs plus per-tile
   slab views; the per-tile pointer array still lets nvCOMP write
   independent slabs in parallel.

2. Per-tile .get().tobytes() in the result-collection loop. Each
   .get() was a separate D2H transfer on the default stream, and the
   per-DMA setup cost dominated wall time for large-N writes (the
   exact pattern #1552 fixed on the decode side). Replace with a
   single cupy.concatenate of trimmed slabs and one .get(), then
   slice the host buffer by cumulative offsets to peel out per-tile
   payloads. The adler32 deflate-wrap step is unchanged.

Real-world benchmark on an Ampere-class GPU: 2048x2048 float32 zstd
GPU write at tile_size=64 (1024 tiles) drops median from 84.3ms to
54.7ms (~35% reduction).

Tests cover the structural change (regressions back to the per-tile
patterns fail loudly) plus end-to-end round-trip equality at
deflate and zstd compression. _check_gpu_memory now bounds the new
contiguous allocation in the same way the decode-side fix does.
@brendancol brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-12-aa0e4ef9-03 branch from 958e8f4 to 14cf257 Compare May 12, 2026 19:29
- Add _check_gpu_memory guard before nvcomp staging-buffer concatenate
  (mirrors _batched_d2h_to_bytes pattern)
- Tighten GPU skip to require cupy.cuda.is_available()
@brendancol brendancol merged commit 635246c into main May 12, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(geotiff): batch _nvcomp_batch_compress per-tile D2H and allocations

2 participants